Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

324 ◾ Bioinformatics

mkdir checkM_out

mkdir checkM_out/healthy

checkm lineage_wf \

-t 4 \

-x fa \

-f checkM_out/healthy_checkm_report.txt \

binning/healthy \

checkM_out/healthy

mkdir checkM_out/moderate

checkm lineage_wf \

-t 4 \

-x fa \

-f checkM_out/moderate_checkm_report.txt \

binning/moderate \

checkM_out/moderate

mkdir checkM_out/severe

checkm lineage_wf \

-t 4 \

-x fa \

-f checkM_out/severe_checkm_report.txt \

binning/severe \

checkM_out/severe

The report (Figure 8.6) shows the bin ID, marker lineage (taxonomic rank), # genome

(number of genomes used to infer marker sets), # marker (number of marker genes), #

marker set (number of sets within the inferred markers), 0–5+ (number of times each

marker gene is identified), completeness (presence/absence of marker genes), and strain

heterogeneity (high heterogeneity indicates the contamination is from one or more closely

related organisms).

In Figure 8.6, for the moderate sample, we can notice that for the bin “moderate.4”, there

are 5449 bacterial genomes that were used to infer 104 markers genes; only 6 genes were

inferred in the bin, the completeness is 10.34, and there was no contamination. For more

details about the use of CheckM and report, refer to the program home page at “https://

ecogenomics.github.io/CheckM/”.

8.2.9 Prediction of Protein-Coding Region

This step is to annotate the single genomes recovered by binning or metagenomic assem-

blies with potential gene locations by predicting the open reading frames (ORFs). The gene

FIGURE 8.6 Genome completeness evaluation report generated by CheckM.